It is important in any data science project to define the objective as specific as possible. Below let's write it from general to specific. This will direct our analysis.
Section 1: Initial Steps
Section 2: Data Cleaning and Preparation, Feature Engineering
Section 3: EDA of all variables and binning
Section 4: Models
GBM is a type of Boosting Model. Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.
Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised.
Deep learning architectures such as deep neural networks, deep belief networks, recurrent neural networks and convolutional neural networks have been applied to fields including computer vision, machine vision, speech recognition, natural language processing, audio recognition, social network filtering, machine translation, bioinformatics, drug design, medical image analysis, material inspection and board game programs, where they have produced results comparable to and in some cases surpassing human expert performance.
H2O is a fully open source, distributed in-memory machine learning platform with linear scalability. H2O supports the most widely used statistical & machine learning algorithms including gradient boosted machines, generalized linear models, deep learning and more. H2O also has an industry leading AutoML functionality that automatically runs through all the algorithms and their hyperparameters to produce a leaderboard of the best models. The H2O platform is used by over 18,000 organizations globally and is extremely popular in both the R & Python communities.
# Import all packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import scipy
import time
import seaborn as sns
sns.set(style="whitegrid")
import warnings
warnings.filterwarnings("ignore")
from sklearn.impute import SimpleImputer
from sklearn import preprocessing
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import roc_curve, auc, roc_auc_score, accuracy_score, confusion_matrix
from sklearn.metrics import average_precision_score
from sklearn.metrics import precision_recall_curve
import plotly
import plotly.express as px
from imblearn.datasets import make_imbalance
import pylab as pl
from collections import Counter
# Read the data
df = pd.read_csv('/Users/harshdhanuka/Desktop/Columbia Class Matter/SEM 3/5420 Anomaly Detection/Assignment 2 EDA/XYZloan_default_selected_vars.csv')
df.head(2)
print("Number of rows and columns in the dataset:")
df.shape
# Check basic statistics
print("Basic statistics of the columns are as follows:")
df.describe()
AP006¶df['AP006'].hist()
df.AP006.hist()
df['AP006'].value_counts()
loan_default¶# Check the target variable column
print("The number of 0's and 1's are:")
print(df['loan_default'].value_counts())
df['loan_default'].hist()
#df.info()
Unnamed: 0, Unnamed: 0.1 and id. They need to be dropped.
AP005 is a Date-Time column, which cannot be used for any predictions in the model. Date-Time columns act as an ID column and all have unique values, which misrepresents the variable while making predictions. The reason is that this field almost becomes a unique identifier for each record. It is as if you employ the ‘id’ field in your decision trees. I will derive year, month, day, weekday, etc. from this field. In some models, we may use ‘year’ as a variable just to explain any special volatility in the past. But we will never use the raw DateTime field as a predictor.
TD025, TD026, TD027, TD028, CR012.
TD029, TD044, TD048, TD051, TD054, TD055, TD061, TD062.
AP002Gender,
AP003Education Code,
AP004Loan Term,
AP006OS Type,
AP007Application City Level,
AP008Flag if City not Application City,
AP009 Binary format,
MB007 Mobile Brands/type
AP005 to the relevant formats of Year, Month, Day ¶df['AP005'] = pd.to_datetime(df['AP005'])
# Create 4 new columns
df['Loan_app_day_name'] = df['AP005'].dt.day_name()
df['Loan_app_month'] = df['AP005'].dt.month_name()
df['Loan_app_time'] = df['AP005'].dt.time
df['Loan_app_day'] = df['AP005'].dt.day
# Drop old column
df = df.drop(columns = ['AP005'])
df.head(2)
df["AP002"] = df["AP002"].astype('object')
df["AP003"] = df["AP003"].astype('object')
df["AP004"] = df["AP004"].astype('object')
df["AP006"] = df["AP006"].astype('object')
df["AP007"] = df["AP007"].astype('object')
df["AP008"] = df["AP008"].astype('object')
df["AP009"] = df["AP009"].astype('object')
df["CR015"] = df["CR015"].astype('object')
df["MB007"] = df["MB007"].astype('object')
df['Loan_app_day_name'] = df['Loan_app_day_name'].astype('object')
df['Loan_app_month'] = df['Loan_app_month'].astype('object')
df['Loan_app_time'] = df['Loan_app_time'].astype('object')
df['Loan_app_day'] = df['Loan_app_day'].astype('object')
df = df.drop(columns = ['Unnamed: 0', 'Unnamed: 0.1', 'id', 'TD025', 'TD026', 'TD027', 'TD028', 'CR012','TD029', 'TD044', 'TD048', 'TD051', 'TD054', 'TD055', 'TD061', 'TD062'])
df.head(2)
As per all the variable description, all the following columns are either counts, lengths, or days. Hence, the negative values such as -999, -99, -98, -1, etc are all mis-read NA's and need to be converted back to 'nan' format.
features_nan = ['AP001',
'TD001', 'TD002', 'TD005', 'TD006', 'TD009', 'TD010',
'TD013', 'TD014', 'TD015', 'TD022', 'TD023', 'TD024', 'CR004', 'CR005',
'CR017', 'CR018', 'CR019', 'PA022', 'PA023', 'PA028',
'PA029', 'PA030', 'PA031', 'CD008', 'CD018', 'CD071', 'CD072', 'CD088',
'CD100', 'CD101', 'CD106', 'CD107', 'CD108', 'CD113', 'CD114', 'CD115',
'CD117', 'CD118', 'CD120', 'CD121', 'CD123', 'CD130', 'CD131', 'CD132',
'CD133', 'CD135', 'CD136', 'CD137', 'CD152', 'CD153', 'CD160', 'CD162',
'CD164', 'CD166', 'CD167', 'CD169', 'CD170', 'CD172', 'CD173', 'MB005']
# Define a function to convert negatives to nan
def convert_to_nan(var):
df[var][df[var] < 0] = np.nan
for i in features_nan:
convert_to_nan(i)
# Verify that the negatives are gone
print("The minimum now stands at 0 for most of the columns, verifying the mis-represented values are gone.")
df[features_nan].describe()
Multivariate imputer that estimates each feature from all the others. A strategy for imputing missing values by modeling each feature with missing values as a function of other features in a round-robin fashion.
The documentation is here: https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html
from sklearn.experimental import enable_iterative_imputer # noqa
from sklearn.impute import IterativeImputer
df_2 = df[features_nan]
# Verify
df_2.head(3)
imp = IterativeImputer(missing_values=np.nan, sample_posterior=False,
max_iter=10, tol=0.001,
n_nearest_features=None, initial_strategy='median')
imp.fit(df_2)
imputed_data_median = pd.DataFrame(data=imp.transform(df_2),
columns=['AP001',
'TD001', 'TD002', 'TD005', 'TD006', 'TD009', 'TD010',
'TD013', 'TD014', 'TD015', 'TD022', 'TD023', 'TD024', 'CR004', 'CR005',
'CR017', 'CR018', 'CR019', 'PA022', 'PA023', 'PA028',
'PA029', 'PA030', 'PA031', 'CD008', 'CD018', 'CD071', 'CD072', 'CD088',
'CD100', 'CD101', 'CD106', 'CD107', 'CD108', 'CD113', 'CD114', 'CD115',
'CD117', 'CD118', 'CD120', 'CD121', 'CD123', 'CD130', 'CD131', 'CD132',
'CD133', 'CD135', 'CD136', 'CD137', 'CD152', 'CD153', 'CD160', 'CD162',
'CD164', 'CD166', 'CD167', 'CD169', 'CD170', 'CD172', 'CD173', 'MB005'],
dtype='int')
imputed_data_median.head(3)
CR009 to a category variable and bin appropriately ¶df['CR009'] = pd.cut(x=df['CR009'], bins=[-1, 100000, 200000, 300000, 400000, 500000, 600000, 700000, 800000, 900000, 1000000, 1500000])
df = df.astype({'CR009':'object'})
df.CR009.value_counts()
corr = df[['loan_default', 'AP001', 'TD001', 'TD002', 'TD005', 'TD006', 'TD009', 'TD010', 'TD013', 'TD014', 'TD015', 'TD022', 'TD023', 'TD024']].corr()
f,ax = plt.subplots(figsize=(18,12))
sns.heatmap(corr, annot=True, cmap='Greens', linewidths=.4, fmt= '.1f',ax=ax)
plt.show()
# Remove 1 feeature from a pair which has over 0.7 ratio
# However, H2O deals with this problem smartly, I will not remove the variables
corr_var_drop1 = ['TD005', 'TD022', 'TD006', 'TD009', 'TD013', 'TD023', 'TD010', 'TD014']
I will be using the other variables as they are all Call detail data.
filter_col = [col for col in df if col.startswith('CD')]
filter_col.append('loan_default')
corr = df[filter_col].corr()
f,ax = plt.subplots(figsize=(21,21))
sns.heatmap(corr, annot=True, cmap='Greens', linewidths=.4, fmt= '.1f',ax=ax)
plt.show()
# Remove 1 feature from a pair which has over 0.7 ratio
# However, H2O deals with this problem smartly, I will not remove the variables
corr_var_drop2 = ['CD173', 'CD172', 'CD170', 'CD169', 'CD167', 'CD166', 'CD164', 'CD162',
'CD137', 'CD136', 'CD135', 'CD133', 'CD132', 'CD131', 'CD117', 'CD118',
'CD120', 'CD121', 'CD123', 'CD114', 'CD113', 'CD108', 'CD107', 'CD106',
'CD101', 'CD072']
df_bin = df.copy(deep = True)
df_bin.head(2)
# Write a function and loop through
def binning(var):
df_bin[var + '_bin'] = pd.qcut(df_bin[var],15,duplicates='drop').values.add_categories("NoData")
df_bin[var + '_bin'] = df_bin[var + '_bin'].fillna("NoData").astype(str)
df_bin[var + '_bin'].value_counts(dropna=False)
features = ['AP001', # 'AP002', 'AP003', 'AP004', 'AP006', 'AP007',
# 'AP008', 'AP009',
'TD001', 'TD002', 'TD005', 'TD006', 'TD009', 'TD010',
'TD013', 'TD014', 'TD015', 'TD022', 'TD023', 'TD024', 'CR004', 'CR005',
#'CR009', 'CR015',
'CR017', 'CR018', 'CR019', 'PA022', 'PA023', 'PA028',
'PA029', 'PA030', 'PA031', 'CD008', 'CD018', 'CD071', 'CD072', 'CD088',
'CD100', 'CD101', 'CD106', 'CD107', 'CD108', 'CD113', 'CD114', 'CD115',
'CD117', 'CD118', 'CD120', 'CD121', 'CD123', 'CD130', 'CD131', 'CD132',
'CD133', 'CD135', 'CD136', 'CD137', 'CD152', 'CD153', 'CD160', 'CD162',
'CD164', 'CD166', 'CD167', 'CD169', 'CD170', 'CD172', 'CD173', 'MB005'
# 'MB007', 'Loan_app_day_name', 'Loan_app_month', 'Loan_app_time',
# 'Loan_app_day'
]
for i in features:
binning(i)
# View the bins of some variables
print(df_bin['TD001_bin'].value_counts(dropna=False))
print(df_bin['TD022_bin'].value_counts(dropna=False))
% Y by X which is the mean column for all the numerical columns here ¶The 'mean' column represents the '% Y by X'.
def plot_X_and_Y(var):
z = df_bin.groupby(var + '_bin')['loan_default'].agg(['count','mean']).reset_index()
z['count_pcnt'] = z['count']/z['count'].sum()
x = z[var + '_bin']
y_mean = z['mean']
count_pcnt = z['count_pcnt']
ind = np.arange(0, len(x))
width = .5
fig = plt.figure(figsize=(16,4))
plt.subplot(121)
plt.bar(ind, count_pcnt, width, color='r')
#plt.ylabel('X')
plt.title(var + ' Distribution')
plt.xticks(ind,x.tolist(), rotation=45)
plt.subplot(122)
plt.bar(ind, y_mean, width, color='b')
#plt.ylabel('Y by X')
plt.xticks(ind,x.tolist(), rotation=45)
plt.tight_layout()
plt.title('Response mean by ' + var)
plt.show()
#for i in features:
# plot_X_and_Y(i)
% Y by X which is the mean column for all the Categorical columns here ¶The 'mean' column represents the '% Y by X'.
features_2 = ['AP002', 'AP003', 'AP004', 'AP006', 'AP007', 'AP008', 'AP009',
'CR009','CR015', 'MB007', 'Loan_app_day_name', 'Loan_app_month',
'Loan_app_day'
]
def plot_X_and_Y_cat(var):
z = df_bin.groupby(var)['loan_default'].agg(['count','mean']).reset_index()
z['count_pcnt'] = z['count']/z['count'].sum()
x = z[var]
y_mean = z['mean']
count_pcnt = z['count_pcnt']
ind = np.arange(0, len(x))
width = .5
fig = plt.figure(figsize=(16,4))
plt.subplot(121)
plt.bar(ind, count_pcnt, width, color='r')
plt.ylabel('X')
plt.title(var + ' Distribution')
plt.xticks(ind,x.tolist(), rotation=45)
plt.subplot(122)
plt.bar(ind, y_mean, width, color='b')
plt.ylabel('Y by X')
plt.xticks(ind,x.tolist(), rotation=45)
plt.tight_layout()
plt.title('Response mean by ' + var)
plt.show()
for i in features_2:
plot_X_and_Y_cat(i)
From the above graphs, the following variables seem to be not important, as they do not have a pattern or a trend, or a curve on the '% Y by x' graph:
Loan_app_day_namedf_count = df['AP006'].value_counts()
df_count = pd.DataFrame(df_count).reset_index()
df_count.columns = ['AP006 - OS Type','Count']
print(df_count.head())
fig = px.bar(df_count, x = 'AP006 - OS Type', y = 'Count', color = 'AP006 - OS Type',
width=600, height=400,
title = "Distribution of OS type")
fig.show()
df_count = df['AP002'].value_counts()
df_count = pd.DataFrame(df_count).reset_index()
df_count.columns = ['AP002 - Gender','Count']
print(df_count.head())
fig = px.bar(df_count, x = 'AP002 - Gender', y = 'Count', color = 'AP002 - Gender',
width=600, height=400,
title = "Distribution of Gender")
fig.show()
df_count = df['AP003'].value_counts()
df_count = pd.DataFrame(df_count).reset_index()
df_count.columns = ['AP003 - Education','Count']
print(df_count.head())
fig = px.bar(df_count, x = 'AP003 - Education', y = 'Count', color = 'AP003 - Education',
width=600, height=400,
title = "Distribution of Education")
fig.show()
fig = px.box(df, x="TD001",width=1000, height=500,
title = "Distribution of TD001 - TD_CNT_QUERY_LAST_7Day_P2P")
fig.show()
fig = px.box(df, x="MB005",width=1000, height=500,
title = "Distribution of MB005")
fig.show()
fig = px.box(df, x="AP007", y="TD001",width=900, height=400,
color = "AP002",
title = "The Distribution of Level Application City by TD_CNT_QUERY_LAST_7Day_P2P")
fig.show()
fig = sns.pairplot(df[['AP002', 'AP003', 'AP004']],
hue= 'AP004')
fig
# Over write the NA value columns, with the previously calculated values
df[features_nan] = imputed_data_median
df.head(2)
df.isnull().sum().sum()
import h2o
h2o.init()
from h2o.estimators.gbm import H2OGradientBoostingEstimator
train,test = train_test_split(df,test_size=0.75, random_state = 1234)
# Convert to a h2o dataframe for computation
df_hex = h2o.H2OFrame(df)
train_hex = h2o.H2OFrame(train)
test_hex = h2o.H2OFrame(test)
# This test_hex will be used all througout the models
# Selecting all predictor variables
predictors = ['AP001', 'AP002', 'AP003', 'AP004', 'AP006', 'AP007',
'AP008', 'AP009', 'TD001', 'TD002', 'TD005', 'TD006', 'TD009', 'TD010',
'TD013', 'TD014', 'TD015', 'TD022', 'TD023', 'TD024', 'CR004', 'CR005',
'CR009', 'CR015', 'CR017', 'CR018', 'CR019', 'PA022', 'PA023', 'PA028',
'PA029', 'PA030', 'PA031', 'CD008', 'CD018', 'CD071', 'CD072', 'CD088',
'CD100', 'CD101', 'CD106', 'CD107', 'CD108', 'CD113', 'CD114', 'CD115',
'CD117', 'CD118', 'CD120', 'CD121', 'CD123', 'CD130', 'CD131', 'CD132',
'CD133', 'CD135', 'CD136', 'CD137', 'CD152', 'CD153', 'CD160', 'CD162',
'CD164', 'CD166', 'CD167', 'CD169', 'CD170', 'CD172', 'CD173', 'MB005',
'MB007', 'Loan_app_day_name', 'Loan_app_month', 'Loan_app_time',
'Loan_app_day']
target = 'loan_default'
len(predictors)
len(df.columns.to_list())
The overall strategy is to test more trees and smaller learning rate. The hyperparameters for tuning are the following:
The learning rate, a value between 0 and 1, corresponds to how quickly the error is corrected from each tree to the next. A small learning rate will result in long computational time, and a large learning rate makes the system unable to settle down. It will be efficient if the learning rate can decay over time. Therefore there is a hyperparameter to decay the learning rate called the "learn_rate_annealing".
"Annealing", in materials science, describes a heating process that heats up in the beginning then cools down slowly. In gbm a common way to decay the learning rate is call the "step decay". It reduces the learning rate by some factor in every few iterations or epochs. Typical values are to reduce the learning rate by a half every 5 epochs. I will use the following set of hyper parameters for training and finding the best GBM Model.
from h2o.grid.grid_search import H2OGridSearch
grid_search_gbm = H2OGradientBoostingEstimator(
stopping_rounds = 15,
col_sample_rate = 0.70,
sample_rate = 0.70,
seed = 1234)
hyper_params = {
'learn_rate':[0.01, 0.02, 0.03, 0.04, 0.05, 0.1],
'max_depth':[2, 3, 4, 5, 6, 7, 8],
'ntrees':[200, 300, 350, 400, 450, 500]}
grid = H2OGridSearch(grid_search_gbm, hyper_params,
grid_id = 'depth_grid',
search_criteria={'strategy': "Cartesian"})
#Train grid search
grid.train(x = predictors,
y = target,
training_frame = train_hex,
validation_frame = test_hex)
grid_sorted = grid.get_grid(sort_by='rmse',decreasing=False)
grid_sorted1 = grid.get_grid(sort_by='r2',decreasing=True)
print(grid_sorted)
print(grid_sorted1)
best_gbm = grid_sorted.models[0]
print(best_gbm)
I see that as per above displayed R square and rmse values, the most optimum parameters for the GBM model are:
I will now build the GBM Model based on these optimum hyper parameters.
gbm_v1 = H2OGradientBoostingEstimator(
model_id = 'gbm_v1',
ntrees = 350,
nfolds = 10,
min_rows = 100,
max_depth = 5,
learn_rate = 0.02,
stopping_rounds = 0,
seed=1234)
gbm_v1.train(predictors,target,training_frame=train_hex, validation_frame = test_hex)
var_imps = pd.DataFrame(gbm_v1.varimp(), columns = ['Variable', 'Relative_Importance',
'Scaled_Importance', 'Percentage'])
The model total has 75 features. After running the initial model with all features, I run many different combinations to get the best LIFT score. I see that the LIFT is best with all the features in the model. So, I will not perform variable importance selection.
#predictors = var_imps['Variable'].head(30).to_list()
# target = 'loan_default'
#gbm_v2 = H2OGradientBoostingEstimator(
# model_id = 'gbm_v2',
# ntrees = 300,
# nfolds = 10,
# min_rows = 100,
# max_depth = 10,
# learn_rate = 0.01,
# stopping_rounds = 10,
# seed = 1234)
#gbm_v2.train(predictors,target,training_frame=train_hex)
var_imps = pd.DataFrame(gbm_v1.varimp(), columns = ['Variable', 'Relative_Importance',
'Scaled_Importance', 'Percentage'])
print()
print("The 10 most important features are: ")
print()
var_imps.head(10)
def VarImp(model_name,m_name):
# plot the variable importance
plt.rcdefaults()
variables = model_name._model_json['output']['variable_importances']['variable']
y_pos = np.arange(len(variables))
fig, ax = plt.subplots(figsize = (8,16))
scaled_importance = model_name._model_json['output']['variable_importances']['scaled_importance']
ax.barh(y_pos,scaled_importance,align='center',color='green')
ax.set_yticks(y_pos)
ax.set_yticklabels(variables)
ax.invert_yaxis()
ax.set_xlabel('Scaled Importance')
ax.set_title(m_name + ' Variable Importance in Decreasing Order')
plt.show()
VarImp(gbm_v1, 'Gradient Boosting GBM')
Use the original test_hex set for prediction
def actual_predict(model,test_hex,target):
y_pred = model.predict(test_hex).as_data_frame()
y_actual = test_hex[target].as_data_frame()
df_actual_predict = pd.concat([y_actual,y_pred],axis=1)
df_actual_predict.columns = ['actual','pred']
return(df_actual_predict)
GBM_actual_predict = actual_predict(gbm_v1,test_hex,target)
GBM_actual_predict.head()
A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The method was developed for operators of military radar receivers, which is why it is so named.
dd = GBM_actual_predict
GBM_roc_auc_value = roc_auc_score(dd['actual'],dd['pred'])
GBM_roc_auc_value
def gains_table(df_actual_predict):
df_actual_predict = df_actual_predict.sort_values(by='pred',ascending=False)
df_actual_predict['row_id'] = range(0,0+len(df_actual_predict))
df_actual_predict['decile'] = (df_actual_predict['row_id'] / (len(df_actual_predict)/10)).astype(int)
df_actual_predict.loc[df_actual_predict['decile'] == 10] =9
# Create gains table
gains = df_actual_predict.groupby('decile')['actual'].agg(['count','sum'])
gains.columns = ['count','actual']
gains
gains['non_actual'] = gains['count'] - gains['actual']
gains['cum_count'] = gains['count'].cumsum()
gains['cum_actual'] = gains['actual'].cumsum()
gains['cum_non_actual'] = gains['non_actual'].cumsum()
gains['percent_cum_actual'] = (gains['cum_actual'] / np.max(gains['cum_actual'])).round(2)
gains['percent_cum_non_actual'] = (gains['cum_non_actual'] / np.max(gains['cum_non_actual'])).round(2)
gains['if_random'] = np.max(gains['cum_actual']) /10
gains['if_random'] = gains['if_random'].cumsum()
gains['lift'] = (gains['cum_actual'] / gains['if_random']).round(2)
gains['K_S'] = np.abs( gains['percent_cum_actual'] - gains['percent_cum_non_actual'] ) * 100
gains['gain'] = (gains['cum_actual'] / gains['cum_count']*100).round(2)
return(gains)
GBM_gains = gains_table(GBM_actual_predict)
GBM_gains
def ROC_PR(df_actual_predict):
print('')
print(' * ROC curve: The ROC curve plots the true positive rate vs. the false positive rate')
print('')
print(' * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy')
print('')
print(' * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)')
print('')
# ROC
roc_auc_value = roc_auc_score(df_actual_predict['actual'],df_actual_predict['pred'])
fpr, tpr, _ = roc_curve(df_actual_predict['actual'],df_actual_predict['pred'])
roc_auc = auc(fpr,tpr)
lw=2
plt.figure(figsize=(10,4))
plt.subplot(1,2,1)
plt.plot(fpr,tpr, color='darkorange',lw=lw,label='AUC = %0.4f)' %roc_auc_value)
plt.plot([0,1],[0,1], color='navy',lw=lw,linestyle='--')
plt.xlim([0,1])
plt.ylim([0,1])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve: AUC={0:0.4f}'.format(roc_auc_value))
plt.legend(loc='lower right')
# Precision-Recall
plt.subplot(1,2,2)
average_precision = average_precision_score(df_actual_predict['actual'],df_actual_predict['pred'])
precision, recall, _ = precision_recall_curve(df_actual_predict['actual'],df_actual_predict['pred'])
plt.step(recall, precision, color='b', alpha=0.2, where='post')
plt.fill_between(recall,precision,step='post',alpha=0.2,color='b')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0,1.05])
plt.ylim([0.0,1.05])
plt.title('Precision-Recall curve: PR={0:0.4f}'.format(average_precision))
ROC_PR(GBM_actual_predict)
I see that in the GBM models, the result is satisfactory.
In our dataset, the minority class represents 19.3% or around 1/5th of the total data, and the majority class represents 81.7% or 4/5th of the total data, or around 4 times the minority class. The underlying assumption for the stable results is that this dataset is not a bad distribution of class, and 20% data in a one class is fine for building a stable model. If we had more than 5-6 times of the minority class data in the majority class, then that would be be settled well by the balance_class parameter in the model.
I also observed that for thisregression problem, balance classes parameter did not make any difference.
Also, for the train test split, I found that a split of 75:25 works best for all my models. I tried other combination, but a 75% train gives me the best Lift score of 2.08 in both the models. So, I will continue with the same split ratio further.
The optimum hyper-parameters for the GBM model built above are:
Please see below for the meaningful business insights:
H2O package is a very effective and efficient package to build a machine learning model for predicting loan default. Also, H2O package is very handy to display the variable importance, handle correlations, and also dummy code the categorical variables.
Gains table and Lift:
For the final model I built after tuning all the models on various different values of each parameter and finally tuning all the hyper-parameters for the best result, the highest Lift score I obtaned is 2.10, which is good as per industry standards. A Lift score of above 2 is suitable for the model to be of acceptable standards.
ROC and AUC: The area under the ROC curve (AUC) assesses overall classification performance. But, AUC does not place more emphasis on one class over the other, so it does not reflect the minority class well. The Precision-Recall (PR) curves will be more informative than ROC when dealing with highly skewed datasets. The PR curves plot precision vs. recall (FPR). Because Precision is directly influenced by class imbalance so the Precision-recall curves are better to highlight differences between models for highly imbalanced data sets.
However, both of them do not accurately represent the results, as one doesnt reflect the minority class well, and the other is sensitive to imbalanced data. Hence, we use the H2O package, which takes care of all these problems for us. We get an AUC of 0.69 and PR of 0.34, which is acceptable, but I will try to improve them further in my following models.
The major outcome of this excercise is that GBM is a good model to predict loan default, as per the given data. However, we should not undermine other good boosting models such as xgboost, or Auto-ML and others. These might provide better results as well.
from h2o.estimators.deeplearning import H2ODeepLearningEstimator
# Selecting all predictor variables
predictors = ['AP001', 'AP002', 'AP003', 'AP004', 'AP006', 'AP007',
'AP008', 'AP009', 'TD001', 'TD002', 'TD005', 'TD006', 'TD009', 'TD010',
'TD013', 'TD014', 'TD015', 'TD022', 'TD023', 'TD024', 'CR004', 'CR005',
'CR009', 'CR015', 'CR017', 'CR018', 'CR019', 'PA022', 'PA023', 'PA028',
'PA029', 'PA030', 'PA031', 'CD008', 'CD018', 'CD071', 'CD072', 'CD088',
'CD100', 'CD101', 'CD106', 'CD107', 'CD108', 'CD113', 'CD114', 'CD115',
'CD117', 'CD118', 'CD120', 'CD121', 'CD123', 'CD130', 'CD131', 'CD132',
'CD133', 'CD135', 'CD136', 'CD137', 'CD152', 'CD153', 'CD160', 'CD162',
'CD164', 'CD166', 'CD167', 'CD169', 'CD170', 'CD172', 'CD173', 'MB005',
'MB007', 'Loan_app_day_name', 'Loan_app_month', 'Loan_app_time',
'Loan_app_day']
target = 'loan_default'
len(predictors)
len(df.columns.to_list())
DL_modl1 = H2ODeepLearningEstimator(
model_id = 'DL_modl1',
distribution = "tweedie",
activation = "RectifierWithDropout",
hidden=[75, 60, 30, 10, 30, 60, 75],
input_dropout_ratio = 0.2,
sparse=True,
l1=1e-5,
epochs=200,
overwrite_with_best_model = True,
stopping_rounds = 0,
balance_classes = False)
DL_modl1.train(
x=predictors,
y=target,
training_frame=train_hex,
validation_frame=test_hex)
I tried Feature selection based on variable importance, but it degrades the model performance. So, I will use all the features.
var_imps = pd.DataFrame(DL_modl1.varimp(), columns = ['Variable', 'Relative_Importance',
'Scaled_Importance', 'Percentage'])
print()
print("The 10 most important features are: ")
print()
var_imps.head(10)
def VarImp(model_name,m_name):
# plot the variable importance
plt.rcdefaults()
variables = model_name._model_json['output']['variable_importances']['variable']
y_pos = np.arange(len(variables))
fig, ax = plt.subplots(figsize = (8,40))
scaled_importance = model_name._model_json['output']['variable_importances']['scaled_importance']
ax.barh(y_pos,scaled_importance,align='center',color='green')
ax.set_yticks(y_pos)
ax.set_yticklabels(variables)
ax.invert_yaxis()
ax.set_xlabel('Scaled Importance')
ax.set_title(m_name + ' Variable Importance in Decreasing Order')
plt.show()
VarImp(DL_modl1, 'Deep Learning DL')
Use the original test_hex set for prediction
DL_actual_predict = actual_predict(DL_modl1,test_hex,target)
DL_actual_predict.head()
A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The method was developed for operators of military radar receivers, which is why it is so named.
dd = DL_actual_predict
DL_roc_auc_value = roc_auc_score(dd['actual'],dd['pred'])
DL_roc_auc_value
DL_gains = gains_table(DL_actual_predict)
DL_gains
ROC_PR(DL_actual_predict)
I see that in the Deep Learning model, the result is satisfactory.
In our dataset, the minority class represents 19.3% or around 1/5th of the total data, and the majority class represents 81.7% or 4/5th of the total data, or around 4 times the minority class. The underlying assumption for the stable results is that this dataset is not a bad distribution of class, and 20% data in a one class is fine for building a stable model. If we had more than 5-6 times of the minority class data in the majority class, then that would be be settled by balancing the classes.
Also, for the train test split, I found that a split of 75:25 works best for all my models. I tried other combination, but a 75% train gives me the best Lift score of 2.11 in both the models. So, I will continue with the same split ratio further.
The optimum hyper-parameters for the GBM model built above are:
Please see below for the meaningful business insights:
H2O package is a very effective and efficient package to build a machine learning model for predicting loan default. Also, H2O package is very handy to display the variable importance, handle correlations, and also dummy code the categorical variables.
Gains table and Lift:
For the final model I built after tuning all the models on various different values of each parameter and finally tuning all the hyper-parameters for the best result, the highest Lift score I obtaned is 1.9, which is good as per industry standards. A Lift score of around 2 is suitable for the model to be of acceptable standards.
ROC and AUC: The area under the ROC curve (AUC) assesses overall classification performance. But, AUC does not place more emphasis on one class over the other, so it does not reflect the minority class well. The Precision-Recall (PR) curves will be more informative than ROC when dealing with highly skewed datasets. The PR curves plot precision vs. recall (FPR). Because Precision is directly influenced by class imbalance so the Precision-recall curves are better to highlight differences between models for highly imbalanced data sets.
However, both of them do not accurately represent the results, as one doesnt reflect the minority class well, and the other is sensitive to imbalanced data. Hence, we use the H2O package, which takes care of all these problems for us. We get an AUC of 0.66 and PR of 0.30, which is acceptable, but I will try to improve them further in my following models.
The major outcome of this excercise is that Deep Learning is a good model to predict loan default, as per the given data. However, we should not undermine other good boosting models such as xgboost, or Auto-ML and others. These might provide better results as well.
As per my model building analysis, I saw that h2o Random Forest (through Over Sampling) gave me the best Lift Score of 3.01, as was seen in the previous week's submission.
For this week, out of Gradient Boosting and Deep Learning models, after fine tuning all parameters, I find that GBM gives a better result, and the highest Lift Score is 2.10, from the GBM model.
So I suggest using either Random Forest or GBM for predicting Loan Defaults. However, other models such as xgboost, and Auto ML might also be explored.